In this post we lay out the first foundational concept involved in the pursuit of ideal nonlinearity - the notion of an organized catalog (or family) of nonlinear functions/features. This discussion centers on the introduction of popular kernel, feedforward neural network, and tree-based catalogs.
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'
%matplotlib notebook
# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%load_ext autoreload
%autoreload 2
In the previous post we saw how we can easily inject general nonlinear functions into our framework for regression/classification, producing nonlinear supervised learners. There we determined the proper form of our nonlinear features by visually inspecting the data - but rarely can we identify a complete nonlinearity / all features of a dataset the way we have here ('by eye'). More often than not in practice, with modern datasets the input dimension $N$ is far too big to allow us to visualize a dataset. Moreover, even when we can visualize a dataset it is not usually the case that we can identify a proper nonlinearity (or set of nonlinearities) by eye.
For example, what feature(s) would you use for the regression dataset below? And the classification dataset shown beside it? It is not very clear in either case, because whatever functions underlying these datasets might be they do not appear to be a familiar elementary function (like sine, tanh, or a polynomial) or small sum of such functions.
What should we do then in general to identify proper nonlinearities? Better yet - what can we do?
Since we cannot determine proper nonlinear features ourselves, if you think about it for a moment, we really have no other choice (unless we just give up): we have to try out various combinations of nonlinear functions/features to see which works best. In other words, we will take various collections of nonlinear functions/features $f_1,\,f_2,...,f_B$ and take their weighted combination as our predict function
\begin{equation} \text{predict}\left(\mathbf{x}, \omega \right) = w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right). \end{equation}After tuning all the parameters (for either the regression or classification task) we can then measure the model's performance. By trying out various combinations of different functions we could hope - if we do this effeciently - find a combination that provides strong performance for a given dataset.
But this is opening a Pandora's Box - there are infinitely many nonlinear functions we could choose from, infinitely many combinations of such functions. So how do we do this effectively?
This is where popular phrases like neural networks, trees, and kernels come in. Each of these three jargon-terms is in fact just a collection of similar-behaving nonlinear functions bundled together to form a distinct family or catalog. The diversity of functions inside each catalog is immense - and in this Section we will introduce each catalog by describing a simple exemplar from each - but across the three catalogs there are some distinct differences - ranging from the sorts of shapes they naturally make to how their parameters are tuned - that differentiate all functions in one catalog from those of the others. In this Section we outline their biggest differences.
Before jumping in to discussing organized families of nonlinear functions, here we illustrate how adding together random functions can fail to provide the sort of fine-grained nonlinear modeling we are after. Here we will explore fitting a combination of four nonlinear functions to the noisy sinusoidal dataset shown in the next Python cell.
# create instance of linear regression demo, used below and in the next examples
csvname = datapath + 'noisy_sin_sample.csv'
demo1 = nonlib.demos_part_2.Visualizer(csvname)
demo1.show_pts()
While we saw in the previous post that we could determine an appropriate nonlinearity, here suppose this is the sort of dataset whose nonlinearity we will have to find via searching over combinations of nonlinear features.
First, suppose we try out various combinations of the following three functions
\begin{equation} f_1(x) = x ~~~~~~~~~~~~~~ f_2(x) = x^2 ~~~~~~~~~~~~~~ f_3(x) = \text{sinc}(10x + 1) \end{equation}which we plot in the next Python cell.
# plot our features
demo1.plot_feats(version = 1)
We will try combinations of these three features in succession. That is we will first try using the first one with a predict function that is simply the weighted sum of $f_1$
Remember here the notation $\omega$ denotes the entire set of weights used (here just $w_0$ and $w_1$). We fit this model to the given dataset by minimizing the corresponding Least Squares cost.
Next we will try a weighted combination of the first two elements $f_1$ and $f_2$
\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) \end{equation}tuning these weights properly by fitting to the dataset.
Finally we try a weighted combination of all three features as
\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) + w_3\,f_3(x) \end{equation}tuning the weights by fitting to the dataset.
With the weights of all three models tuned lets plot the resulting representations. In the left, middle, and right panels below we plot the dataset and the resulting fit from the first, second, and third models respectively.
# plots showing the first (left), second (middle), and third (right) model predictions
demo1.show_fits(version = 1)
Here none of the predictions are adequate. With the first two we have failed to fully capture the nonlinear nature of the data, and with the third we have greatly overestimated this nonlinearity.
Moreover, while the nonlinearity of the first two predictions changes fairly gradually, the leap in nonlinearity from the second to third predictions is substantial. This is due to the fact that while the first two features - monomials of two successive degrees - are closely related, the third feature is drastically different than the first two. With such drastically different features we are unable to fine tune dial the sort of nonlinearity we want.
If we swap out the third sinc feature for the next degree three monomial term - giving the three features
we not only gain significantly more fine-grained control over how much nonlinearity we introduce into our prediction, but here can produce a much better result.
We plot all three of these features in the next Python cell.
# plot our features
demo1.plot_feats(version = 2)
And now - just as previous - we use the first of these features, a linear combination of the first two, and then finally a combination of all three to make predictions. In the next cell we plot the first, second, and third predictions in the left, middle, and right panels respectively.
# show first round of fits
demo1.show_fits(version = 2)
By using three related features - as opposed to having one odd-ball nonlinearity - we get a fantastic fit using all three features this time. Moreover the change in nonlinearity is more gradual, predictable, and controllable across the three predictions.
This example highlights an important point: we need to be organized and thoughtful in our search for the right combination of features. Simply taking random combinations of various nonlinear functions as a predictor e.g., like
\begin{equation} \text{predict}(x,\omega) = w_0 + w_1\text{tanh}\left(w_2 + w_3x\right) + w_4e^{w_5x} + w_6x^{10} + \cdots \end{equation}does not generally allow for fine-tuning of nonlinearity in a prediction (nor effecient computation, as we will see). Conversely if we retrict ourselves to using a set of related functions we can better manage the amount of nonlinearity in our models.
The catalog of kernel functions consists of groups of functions with no internal parameters, a primary example being polynoimals. When dealing with just one input this sub-family of functions looks like
\begin{equation} f_1(x) = x, ~~ f_2(x) = x^2,~~ f_3(x)=x^3,... \end{equation}and so forth with the $D^{th}$ element taking the form $f_D(x) = x^D$. A combination of the first $D$ units from this sub-family is often referred to as a degree $D$ polynomial. There are an infinite number of these functions - one for each positive whole number - and they are naturally ordered by their degree (the higher the degree, the further down the list that function is). Moreover each element has a fixed shape - e.g., the monomial $f_2(x) = x^2$ is always a parabola, and looks like a cup facing upwards. In the next Python cell we plot the first four elements (also called units) of the polynomial sub-family of kernel functions.
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_poly()
In two inputs $x_1$ and $x_2$ polynomial units take the analagous form
\begin{equation} f_1(x_1,x_2) = x_1, ~~ f_2(x_1,x_2) = x_2^2, ~~ f(x_1,x_2) = x_1x_2^3, ~~ f(x_1,x_2) = x_1^4x_2^6,... \end{equation}with a general degree $D$ unit taking the form
\begin{equation} f_m(x_1,x_2) = x_1^px_2^q \end{equation}where where $p$ and $q$ are nonnegative integers and $p + q \leq D$. A degree $D$ polynomial in this case is a linear combinatino of all such units. This sort of definition generalizes to defining polynomial units in general $N$ dimensional input as well.
In the Python cell below we draw a sampling of these polynomial units.
# draw a few polynomial units in two inputs
demo.show_2d_poly()
In this example we animate various predictions of $B$ polynomial units, which generally take the form
\begin{equation} \text{predict}\left(\mathbf{x}, \omega \right) = w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right). \end{equation}to a variety of simple datasets. Notice in each how - due to the fact that these features come from the same sub-family of related nonlinear functions - that the corresponding predictions change fairly gradually as additional units are added to the model (in other words - we how can fine-tune the amount of nonlinearity we want in our prediction).
First in the next cell we animate the final trained predictions of fitting first $B = 1$ through $B = 4$ polynomial units to the noisy sinusoid dataset shown in the previous Subsection. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data in the left panel. In the right panel we simultaenously show the Least Squares cost function value of each trained model.
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,5)])
Next, below in the following Python cell we animate the final trained predictions of fitting first $B = 1$ through $B = 10$ polynomial units to a three-dimensional noisy sinusoid dataset. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data. We do not plot the corresponding Least Squares cost function value, but as with the above example it decreases as we add more polynomial units to the model.
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_units = [v for v in range(1,10)] ,view = [20,110],basis = 'poly')
And finally, an analagous example with 2-class classification. As you move the slider from left to right additional polynomial units are added to the model. In this case each notch of the slider adds multiple units - as we are increasing in $D$ the degree of the total polynoimal (which, for two inputs, means adding several individual polynomiali units each time $D$ is increased).
# load in dataset http://math.arizona.edu/~dsl/
csvname = datapath + '2_eggs.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)
# run animator
demo.brows_single_fits(num_units = [v for v in range(0,4)], basis = 'poly',view = [30,-80])
3D CLASSIFICATION FIT GOES HERE
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_net(num_layers = 1,activation = 'tanh')
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_net(num_layers = 1,activation = 'relu')
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tanh',num_elements = [v for v in range(1,5)])
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements = [v for v in range(1,12)] ,view = [20,110],basis = 'net')
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_tree(depth = 1)
demo = nonlib.stump_visualizer_2d.Visualizer()
csvname = datapath + 'noisy_sin_raised.csv'
demo.load_data(csvname)
demo.browse_stumps()
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tree',num_elements = [v for v in range(1,10)])
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements = [v for v in range(1,20)] ,view = [20,110],basis = 'tree')
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'sin_function.csv'
demo.load_data(csvname)
demo.brows_fits(num_elements = [v for v in range(1,50,1)])
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(cvsname)
demo.brows_fits(num_elements = [v for v in range(1,25)])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,25)])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='poly',num_elements = [v for v in range(1,10)],folds = 3)
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='tanh',num_elements = [v for v in range(1,10)],folds = 3)
